{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在上一章节，我们对赛题的数据进行了读取，并在末尾给出了两个小作业。如果你顺利完成了作业，那么你基本上对`Python`也比较熟悉了。在本章我们将使用传统机器学习算法来完成新闻分类的过程，将会结束到赛题的核心知识点。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **Task3 基于机器学习的文本分类**\n",
    "\n",
    "在本章我们将开始使用机器学习模型来解决文本分类。机器学习发展比较广，且包括多个分支，本章侧重使用传统机器学习，从下一章开始是基于深度学习的文本分类。\n",
    "\n",
    "### **学习目标**\n",
    "\n",
    "- 学会TF-IDF的原理和使用\n",
    "- 使用sklearn的机器学习模型完成文本分类\n",
    "\n",
    "### **机器学习模型**\n",
    "\n",
    "机器学习是对能通过经验自动改进的计算机算法的研究。机器学习通过历史数据**训练**出**模型**对应于人类对经验进行**归纳**的过程，机器学习利用**模型**对新数据进行**预测**对应于人类利用总结的**规律**对新问题进行**预测**的过程。\n",
    "\n",
    "\n",
    "机器学习有很多种分支，对于学习者来说应该优先掌握机器学习算法的分类，然后再其中一种机器学习算法进行学习。由于机器学习算法的分支和细节实在是太多，所以如果你一开始就被细节迷住了眼，你就很难知道全局是什么情况的。\n",
    "\n",
    "\n",
    "如果你是机器学习初学者，你应该知道如下的事情：\n",
    "\n",
    "1. 机器学习能解决一定的问题，但不能奢求机器学习是万能的；\n",
    "2. 机器学习算法有很多种，看具体问题需要什么，再来进行选择；\n",
    "3. 每种机器学习算法有一定的偏好，需要具体问题具体分析；"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "<div align=center><img src=\"http://jupter-oss.oss-cn-hangzhou.aliyuncs.com/public/files/image/1095279501877/1594909490018_ZFUWhFUSbB.jpg\" width=\"50%\" height=\"50%\"></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **文本表示方法 Part1**\n",
    "\n",
    "在机器学习算法的训练过程中，假设给定$N$个样本，每个样本有$M$个特征，这样组成了$N×M$的样本矩阵，然后完成算法的训练和预测。同样的在计算机视觉中可以将图片的像素看作特征，每张图片看作hight×width×3的特征图，一个三维的矩阵来进入计算机进行计算。\n",
    "\n",
    "但是在自然语言领域，上述方法却不可行：文本是不定长度的。文本表示成计算机能够运算的数字或向量的方法一般称为词嵌入（Word Embedding）方法。词嵌入将不定长的文本转换到定长的空间内，是文本分类的第一步。\n",
    "\n",
    "#### **One-hot**\n",
    "\n",
    "这里的One-hot与数据挖掘任务中的操作是一致的，即将每一个单词使用一个离散的向量表示。具体将每个字/词编码一个索引，然后根据索引进行赋值。\n",
    "\n",
    "One-hot表示方法的例子如下：\n",
    "\n",
    "```python\n",
    "句子1：我 爱 北 京 天 安 门\n",
    "句子2：我 喜 欢 上 海\n",
    "```\n",
    "\n",
    "首先对所有句子的字进行索引，即将每个字确定一个编号：\n",
    "\n",
    "```python\n",
    "{\n",
    "\t'我': 1, '爱': 2, '北': 3, '京': 4, '天': 5,\n",
    "  '安': 6, '门': 7, '喜': 8, '欢': 9, '上': 10, '海': 11\n",
    "}\n",
    "```\n",
    "\n",
    "在这里共包括11个字，因此每个字可以转换为一个11维度稀疏向量：\n",
    "\n",
    "```\n",
    "我：[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n",
    "爱：[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n",
    "...\n",
    "海：[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]\n",
    "```\n",
    "\n",
    "#### **Bag of Words**\n",
    "\n",
    "Bag of Words（词袋表示），也称为Count Vectors，每个文档的字/词可以使用其出现次数来进行表示。\n",
    "\n",
    "```python\n",
    "句子1：我 爱 北 京 天 安 门\n",
    "句子2：我 喜 欢 上 海\n",
    "```\n",
    "\n",
    "直接统计每个字出现的次数，并进行赋值：\n",
    "\n",
    "```python\n",
    "句子1：我 爱 北 京 天 安 门\n",
    "转换为 [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]\n",
    "\n",
    "句子2：我 喜 欢 上 海\n",
    "转换为 [1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]\n",
    "```\n",
    "\n",
    "在sklearn中可以直接`CountVectorizer`来实现这一步骤：\n",
    "\n",
    "```\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "corpus = [\n",
    "    'This is the first document.',\n",
    "    'This document is the second document.',\n",
    "    'And this is the third one.',\n",
    "    'Is this the first document?',\n",
    "]\n",
    "vectorizer = CountVectorizer()\n",
    "vectorizer.fit_transform(corpus).toarray()\n",
    "```\n",
    "\n",
    "#### **N-gram**\n",
    "\n",
    "N-gram与Count Vectors类似，不过加入了相邻单词组合成为新的单词，并进行计数。\n",
    "\n",
    "如果N取值为2，则句子1和句子2就变为：\n",
    "\n",
    "```\n",
    "句子1：我爱 爱北 北京 京天 天安 安门\n",
    "句子2：我喜 喜欢 欢上 上海\n",
    "```\n",
    "\n",
    "#### **TF-IDF**\n",
    "\n",
    "TF-IDF 分数由两部分组成：第一部分是**词语频率**（Term Frequency），第二部分是**逆文档频率**（Inverse Document Frequency）。其中计算语料库中文档总数除以含有该词语的文档数量，然后再取对数就是逆文档频率。\n",
    "\n",
    "```\n",
    "TF(t)= 该词语在当前文档出现的次数 / 当前文档中词语的总数\n",
    "IDF(t)= log_e（文档总数 / 出现该词语的文档总数）\n",
    "```\n",
    "\n",
    "### **基于机器学习的文本分类**\n",
    "\n",
    "接下来我们将对比不同文本表示算法的精度，通过本地构建验证集计算F1得分。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.7422037924439758\n"
     ]
    }
   ],
   "source": [
    "# Count Vectors + RidgeClassifier\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "from sklearn.linear_model import RidgeClassifier\n",
    "from sklearn.metrics import f1_score\n",
    "\n",
    "train_df = pd.read_csv('../data/train_set.csv', sep='\\t', nrows=15000)\n",
    "\n",
    "vectorizer = CountVectorizer(max_features=3000)\n",
    "train_test = vectorizer.fit_transform(train_df['text'])\n",
    "\n",
    "clf = RidgeClassifier()\n",
    "clf.fit(train_test[:10000], train_df['label'].values[:10000])\n",
    "\n",
    "val_pred = clf.predict(train_test[10000:])\n",
    "print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))\n",
    "# 0.74"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.8721598830546126\n"
     ]
    }
   ],
   "source": [
    "# TF-IDF +  RidgeClassifier\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.linear_model import RidgeClassifier\n",
    "from sklearn.metrics import f1_score\n",
    "\n",
    "train_df = pd.read_csv('../data/train_set.csv', sep='\\t', nrows=15000)\n",
    "\n",
    "tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=3000)\n",
    "train_test = tfidf.fit_transform(train_df['text'])\n",
    "\n",
    "clf = RidgeClassifier()\n",
    "clf.fit(train_test[:10000], train_df['label'].values[:10000])\n",
    "\n",
    "val_pred = clf.predict(train_test[10000:])\n",
    "print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))\n",
    "# 0.87"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **本章小结**\n",
    "\n",
    "本章我们介绍了基于机器学习的文本分类方法，并完成了两种方法的对比。\n",
    "\n",
    "### **本章作业**\n",
    "\n",
    "1. 尝试改变TF-IDF的参数，并验证精度\n",
    "2. 尝试使用其他机器学习模型，完成训练和验证"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  },
  "pycharm": {
   "stem_cell": {
    "cell_type": "raw",
    "metadata": {
     "collapsed": false
    },
    "source": []
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
